library(tidyverse) # for graphing and data cleaning
library(googlesheets4) # for reading googlesheet data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
gs4_deauth() # To not have to authorize each time you knit.
theme_set(theme_minimal()) # My favorite ggplot() theme :)
#Lisa's garden data
garden_harvest <- read_sheet("https://docs.google.com/spreadsheets/d/1DekSazCzKqPS2jnGhKue7tLxRU3GVL1oxi-4bEM5IWw/edit?usp=sharing") %>%
mutate(date = ymd(date))
# Seeds/plants (and other garden supply) costs
supply_costs <- read_sheet("https://docs.google.com/spreadsheets/d/1dPVHwZgR9BxpigbHLnA0U99TtVHHQtUzNB9UR0wvb7o/edit?usp=sharing",
col_types = "ccccnn")
# Planting dates and locations
plant_date_loc <- read_sheet("https://docs.google.com/spreadsheets/d/11YH0NtXQTncQbUse5wOsTtLSKAiNogjUA21jnX5Pnl4/edit?usp=sharing",
col_types = "cccnDlc")%>%
mutate(date = ymd(date))
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week. Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(week_day = wday(date, label = TRUE)) %>%
group_by(week_day,vegetable) %>%
mutate(wt_lbs = weight*0.00220462) %>%
summarize(daily_wt_lbs = sum(wt_lbs)) %>%
pivot_wider(id_cols = vegetable,
names_from = week_day,
values_from = daily_wt_lbs,
values_fill = 0)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot variable from the plant_date_loc table. This will not turn out perfectly. What is the problem? How might you fix it?garden_summary <- garden_harvest %>%
group_by(vegetable, variety, date) %>%
mutate(wt_lbs = weight*0.00220462) %>%
summarize(daily_wt_lbs = sum(wt_lbs))
garden_summary %>%
left_join(plant_date_loc,
by = c("vegetable", "variety"))
As shown above, there is a replication of certain vegetables and varieties. For example, in row 17 and 18, the beans(Bush Bush Slender variety) harvested on 2020-07-06 are reported as harvested in both plot M and D. However, in reality these vegetables and variaties have not been harvest twice. When Lisa collected her data, she didn’t report the plot where she harvest from. Therefore, there is a replication of certain vegetables and varieties reported while that isn’t accurate. This could be fixed if Lisa would report the plot in which each vegetable and variety is harvested.
garden_harvest and supply_cost datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.First, we would have to determine the harvest yield for each vegetable by summarizing the vegetable data in garden_harvest. We could then find the price of each vegetable on the whole foods website. Next, we could left_join the supply_cost data set to garden_harvest by variety to determine the cost of the supplies for the garden. Finally, using this combined data set we could determine the price differential between the supplies for each vegetable and the retail price from whole foods. This would determine how much money was saved from growing your own crops.
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(variety) %>%
summarize(first_harvest_date = min(date), total_harvest_lbs = sum(weight) * 0.00220462) %>%
arrange(first_harvest_date) -> ordered_by_date
ordered_by_date
ordered_by_date %>%
ggplot(aes(y = fct_relevel(variety, "grape", "Big Beef", "Bonny Best", "Better Boy", "Cherokee Purple", "Amish Paste", "Mortgage Lifter", "Jet Star", "Old German", "Black Krim", "Brandywine", "volunteers"), x = total_harvest_lbs)) +
geom_col(fill = "darkred") +
labs(title = "Total Harvest Weight of Different Tomato Varieties (Ordered by First Harvest Date)",
x = "Total Harvest Weight (lbs)",
y = "Variety")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
mutate(variety_lower = str_to_lower(variety)) %>%
mutate(variety_length = str_length(variety)) %>%
mutate(variety2 = fct_infreq(variety)) %>%
distinct(vegetable, variety, .keep_all = TRUE) %>%
arrange(vegetable, variety_length)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
mutate(has_r = str_detect(variety, "er") | str_detect(variety, "ar")) %>%
distinct(variety, has_r)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density() +
labs(title = "Event Versus Date",
x = "Date",
y = "Density")
In the density plot above, we observe that the bikes are rented more often during October and November than in December and January. This can be explained by the transition from spring weather into winter weather.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
ggplot(aes(x = time_of_day)) +
geom_density() +
labs(title = "Event Versus Time of Day",
x = "Time of Day",
y = "Density")
In the density plot above, we observe the time of the day that bikes are rented out. As shown, bikes are rented out more often during two specific time periods of the day: early morning between 7am and 8am and in the afternoon between 5pm and 6pm. This can be explained by the fact that people use bikes during this time period to go to work in the morning and to go home in the afternoon. This is at the same time as the typical rush hour in public transportation.
Trips %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(y = days_of_week)) +
geom_bar() +
labs(title = "Event Versus Day of the Week",
x = "Count",
y = "Day of the Week")
As shown in the bar graph above, bikes are rented out more during the weekdays than during the weekends. This could be explained by the fact that individuals rent out these bikes to go from point A to point B during the workweek while they are off on the weekends and therefore don’t need to rent out a bike.
Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day)) +
geom_density() +
facet_wrap(~days_of_week) +
labs(title = "Event Versus Time and Day of the Week",
x = "Time of Day",
y = "Density")
There is a clear pattern during the weekdays as well as during the weekends. During the weekdays, there are peaks in the bike rentals during the early morning period before regular work hours start and in the afternoon after regular work hours have ended. Furthermore, both Saturday and Sunday have extremely similar patterns as most people use these bikes during the day (peak at 3pm) instead of during the early morning and/or late afternoon.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises. Repeat the graphic from Exercise @ref(exr:exr-temp) (d) with the following changes:
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day, fill=client)) +
geom_density(color="NA", alpha = 0.5) +
facet_wrap(~days_of_week) +
labs(title = "Event Versus Time and Day of the Week",
x = "Time of Day",
y = "Density")
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day, fill=client)) +
geom_density(color="NA", alpha = 0.5, position = position_stack()) +
facet_wrap(~days_of_week) +
labs(title = "Event Versus Time and Day of the Week",
x = "Time of Day",
y = "Density")
I don’t believe that one plot is better or worse than the other in telling the story because both plots provide different information. In the first plot, we are able to compare the distributions for both registered and casual clients. Furthermore, we are also able to observe how both clients differ in renting out a bike at different times. In the second plot, we are able to show what proportion of the rides is represented by the different type of client. For example, we are able to tell from the second plot that it’s mainly registered clients that rent out bikes on Saturday morning, while we are unable to easily observe this in the first plot.
weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
mutate(weekend = ifelse(days_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday")) %>%
ggplot(aes(x = time_of_day, fill=client), color="NA", alpha = 0.5) +
geom_density(position = position_stack()) +
facet_wrap(~weekend) +
labs(title = "Different Client Usage",
x = "Time of Day",
y = "Density")
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
mutate(weekend = ifelse(days_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday")) %>%
ggplot(aes(x = time_of_day, fill=days_of_week), color="NA", alpha = 0.5) +
geom_density(position = position_stack()) +
facet_wrap(~client) +
labs(title = "Different Client Usage",
x = "Time of The Day",
y = "Density")
The graph above tells us more specific information about each day of the week which allows us to determine which day shave the most casual and registered users respectively. I do not believe that one graph is better than the other, they both give different information that is useful to answer different questions.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Stations %>%
left_join(Trips,
by = c("name" = "sstation")) %>%
group_by(name) %>%
mutate(total_departures = n()) %>%
ggplot(aes(x = long, y = lat, color = total_departures)) +
geom_jitter() +
labs(title = "Total Departures from Each Rental Location",
x = "Longitude",
y ="Latitude")
Stations %>%
left_join(Trips,
by = c("name" = "sstation")) %>%
group_by(name, long, lat) %>%
summarize(percent_casual= mean(client == "Casual")) %>%
ggplot(aes(x = long, y = lat, color = percent_casual)) +
geom_point() +
labs(title = "Total Departures from Each Rental Location",
x = "Longitude",
y ="Latitute")
There are many stations along the 38.9 latitude, which could be explained by the fact that this is a downtown area in the middle of a city, with metro and train stations located in this area. Furthermore, there seems to be a diagonal street from left top to bottom right with the amount of stations that fit into that pattern. Lastly, there seems to be a high percentage of casual riders at certain stations along the -77.05 longitude. Once again, this is possibly the center of the downtown area where casual riders rent out their bikes.
as_date(sdate) converts sdate from date-time format to date format.Top_Trips <- Trips %>%
mutate(trip_date = as_date(sdate)) %>%
group_by(sstation, trip_date) %>%
count() %>%
arrange(desc(n)) %>%
head(10)
Top_Trips
Trips%>%
mutate(trip_date = as_date(sdate)) %>%
inner_join(Top_Trips,
by = c("sstation", "trip_date"))
Trips %>%
mutate(trip_date = as_date(sdate)) %>%
inner_join(Top_Trips,
by = c("sstation", "trip_date")) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
group_by(client, days_of_week) %>%
summarize(number_riders = n()) %>%
mutate(total_prop = number_riders/sum(number_riders)) %>%
pivot_wider(id_cols = days_of_week,
names_from = client,
values_from = total_prop)
From the table above, we are able to conclude that at the ten station-date combinations with the highest number of departures, 56% of casual riders rent out a bike on Saturday and 36% of casual riders rent out a bike on Sunday. This shows that there is a very low percentage of casual riders that rent out a bike at these stations during the weekdays. In comparison, only 10% of registered riders rent out a bike on Saturday while only ~6% of registered riders rent out a bike on Sunday. Therefore, a much higher percentage of registered clients rents out a bike during the weekdays than casual clients.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.kids %>%
filter(variable %in% "lib") %>%
ggplot(aes(x = year, y = inf_adj_perchild)) +
geom_line(color = "white", size =0.5) +
theme(legend.position = "dodge") +
theme_void() +
theme(plot.background = element_rect(fill = "lightsteelblue4")) +
facet_geo(vars(state), grid = "us_state_grid3", label = "name") +
labs(title="Change in public spending on libraries",
subtitle = "Dollars spent per child, adjusted for inflation")+
theme(plot.title = element_text(hjust = 0.5, size =20, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 15))